Hadoop-BAM: directly manipulating next generation sequencing data in the cloud

نویسندگان

  • Matti Niemenmaa
  • Aleksi Kallio
  • André Schumacher
  • Petri Klemelä
  • Eija Korpelainen
  • Keijo Heljanko
چکیده

Hadoop-BAM is a novel library for the scalable manipulation of aligned next-generation sequencing data in the Hadoop distributed computing framework. It acts as an integration layer between analysis applications and BAM files that are processed using Hadoop. Hadoop-BAM solves the issues related to BAM data access by presenting a convenient API for implementing map and reduce functions that can directly operate on BAM records. It builds on top of the Picard SAM JDK, so tools that rely on the Picard API are expected to be easily convertible to support large-scale distributed processing. In this article we demonstrate the use of Hadoop-BAM by building a coverage summarizing tool for the Chipster genome browser. Our results show that Hadoop offers good scalability, and one should avoid moving data in and out of Hadoop between analysis steps.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

SparkSeq: fast, scalable, cloud-ready tool for the interactive genomic data analysis with nucleotide precision

Many time-consuming analyses of next generation sequencing data can be addressed with modern cloud computing. The Apache Hadoop-based solutions have become popular in genomics due to their scalability in a cloud infrastructure. So far, most of these tools have been used for batch data processing rather than interactive data

متن کامل

CloudDOE: A User-Friendly Tool for Deploying Hadoop Clouds and Analyzing High-Throughput Sequencing Data with MapReduce

BACKGROUND Explosive growth of next-generation sequencing data has resulted in ultra-large-scale data sets and ensuing computational problems. Cloud computing provides an on-demand and scalable environment for large-scale data analysis. Using a MapReduce framework, data and workload can be distributed via a network to computers in the cloud to substantially reduce computational latency. Hadoop/...

متن کامل

NGSUtils: a software suite for analyzing and manipulating next-generation sequencing datasets

SUMMARY NGSUtils is a suite of software tools for manipulating data common to next-generation sequencing experiments, such as FASTQ, BED and BAM format files. These tools provide a stable and modular platform for data management and analysis. AVAILABILITY AND IMPLEMENTATION NGSUtils is available under a BSD license and works on Mac OS X and Linux systems. Python 2.6+ and virtualenv are requir...

متن کامل

Using Bioinformatics Applications on the Cloud

Dealing with large genomic data on a limited computing resource has been an inevitable challenge in life science. Bioinformatics applications have required high performance computation capabilities for next-generation sequencing (NGS) data and the human genome sequencing data with single nucleotide polymorphisms (SNPs). From 2008, Cloud computing platforms have been widely adopted to deal with ...

متن کامل

Strategies and Clinical Applications of Next Generation Sequencing

Abstract DNA sequencing is one of the great valuable techniques in molecular biology, which can be used to detect the sequence of nucleotides in a DNA fragment. The high-throughput se­quencing known as Next Generation Sequencing (NGS) revolutionized genomic research and molecular biology; therefore, the whole human genome can be sequenced with a low cost in several days. NGS technology is simi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره 28  شماره 

صفحات  -

تاریخ انتشار 2012